Goto

Collaborating Authors

 Guangdong Province




Humanoid workers and surveillance buggies: 'embodied AI' is reshaping daily life in China

The Guardian

On a misty Saturday afternoon in Shenzhen's Central Park, a gaggle of teenage girls are sheltering from the drizzle under a concrete canopy. With their bags of crisps piled high in front of them, they crowd around a couple of smartphones to sing along to Mandopop ballads. The sound of their laughter rings out across the surrounding lawn โ€“ until it is pierced by a mechanical buzzing sound. A few metres away from the impromptu karaoke session is an "airdrop cabinet", one of more than 40 in Shenzhen that is operated by Meituan, China's biggest food delivery platform. Hungry park-goers can order anything from rice noodles to Subway sandwiches to bubble tea.


U.K. raises alarm on Chinese drones used to survey sensitive sites

The Japan Times

U.K. government officials have raised private concerns that Chinese-manufactured drones are being used to take high resolution images of critical national infrastructure sites in the U.K., going against guidance from the country's security services. National Grid PLC, which operates the nation's electricity and gas networks, uses drones made by Shenzhen-based SZ DJI Technology to take videos, photographs and thermal images of its electricity substations, according to information posted on its website as recently as September. DJI drones have also been used to survey the construction of Electricite de France SA's Hinkley Point C nuclear power plant, to inspect solar farms, and by Thames Water to monitor reservoirs and the water supply. Deployment of the drones comes despite a warning in 2023 by the U.K.'s National Protective Security Authority (NPSA), part of the domestic security service MI5, that British organizations managing sensitive sites should be wary of using drones "manufactured in countries with coercive data sharing practices," a reference to China. Moreover, in 2022, the U.S. Department of Defense included DJI on a blacklist of Chinese firms with military ties.


TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

arXiv.org Artificial Intelligence

The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.


FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning

arXiv.org Artificial Intelligence

--Audio-Visual Question Answering (A VQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing A VQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. T o address these challenges, we first introduce a novel dataset, FortisA VQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-A VQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MA VEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisA VQA, with a notable improvement of 7.81%. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. UMANS possess the extraordinary capacity to seam-lessly integrate auditory and visual cues, effectively establishing a cohesive relationship between visual and auditory stimuli [1-3]. Jie Ma, Pinghui Wang, Jing Tao and Zhou Su are with the Ministry of Education of Key Laboratory for Intelligent Networks and Network Security, School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China. Zhitao Gao and Jun Liu are with the Shannxi Provincial Key Laboratory of Big Data Knowledge Engineering, School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China. Qi Chai is with the Information Hub, Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 510000, China. The question in current A VQA datasets is generated by a limited set of predefined templates, which may not be in line with the real-world scenario. Our findings indicate that existing methods such as STG [6] are not robust, which may be attributed to excessive bias learning, such as memorizing statistical regularities between critical question words and answers. It requires the system to learn high-order interaction representations of the concepts encompassed with audio, video, and language modalities. As is known to us [8-10], the high-level reasoning ability of the system mainly relies on large-scale data that does not contain harmful biases or statistical regularities. However, completely avoiding the negative bias in datasets seems challenging [11] due to the inherent skewness in real-world data distributions.


TRACE: Intra-visit Clinical Event Nowcasting via Effective Patient Trajectory Encoding

arXiv.org Artificial Intelligence

Electronic Health Records (EHR) have become a valuable resource for a wide range of predictive tasks in healthcare. However, existing approaches have largely focused on inter-visit event predictions, overlooking the importance of intra-visit nowcasting, which provides prompt clinical insights during an ongoing patient visit. To address this gap, we introduce the task of laboratory measurement prediction within a hospital visit. We study the laboratory data that, however, remained underexplored in previous work. We propose TRACE, a Transformer-based model designed for clinical event nowcasting by encoding patient trajectories. TRACE effectively handles long sequences and captures temporal dependencies through a novel timestamp embedding that integrates decay properties and periodic patterns of data. Additionally, we introduce a smoothed mask for denoising, improving the robustness of the model. Experiments on two large-scale electronic health record datasets demonstrate that the proposed model significantly outperforms previous methods, highlighting its potential for improving patient care through more accurate laboratory measurement nowcasting. The code is available at https://github.com/Amehi/TRACE.


KptLLM: Unveiling the Power of Large Language Model for Keypoint Comprehension

Neural Information Processing Systems

Recent advancements in Multimodal Large Language Models (MLLMs) have greatly improved their abilities in image understanding. However, these models often struggle with grasping pixel-level semantic details, e.g., the keypoints of an object. To bridge this gap, we introduce the novel challenge of Semantic Keypoint Comprehension, which aims to comprehend keypoints across different task scenarios, including keypoint semantic understanding, visual prompt-based keypoint detection, and textual prompt-based keypoint detection. Moreover, we introduce KptLLM, a unified multimodal model that utilizes an identify-then-detect strategy to effectively address these challenges. KptLLM underscores the initial discernment of semantics in keypoints, followed by the precise determination of their positions through a chain-of-thought process. With several carefully designed modules, KptLLM adeptly handles various modality inputs, facilitating the interpretation of both semantic contents and keypoint locations. Our extensive experiments demonstrate KptLLM's superiority in various keypoint detection benchmarks and its unique semantic capabilities in interpreting keypoints.


An Offline Adaptation Framework for Constrained Multi-Objective Reinforcement Learning Qian Lin

Neural Information Processing Systems

In recent years, significant progress has been made in multi-objective reinforcement learning (RL) research, which aims to balance multiple objectives by incorporating preferences for each objective. In most existing studies, specific preferences must be provided during deployment to indicate the desired policies explicitly. However, designing these preferences depends heavily on human prior knowledge, which is typically obtained through extensive observation of high-performing demonstrations with expected behaviors. In this work, we propose a simple yet effective offline adaptation framework for multi-objective RL problems without assuming handcrafted target preferences, but only given several demonstrations to implicitly indicate the preferences of expected policies. Additionally, we demonstrate that our framework can naturally be extended to meet constraints on safety-critical objectives by utilizing safe demonstrations, even when the safety thresholds are unknown. Empirical results on offline multi-objective and safe tasks demonstrate the capability of our framework to infer policies that align with real preferences while meeting the constraints implied by the provided demonstrations.


Unsupervised Anomaly Detection in The Presence of Missing Values

Neural Information Processing Systems

Anomaly detection methods typically require fully observed data for model training and inference and cannot handle incomplete data, while the missing data problem is pervasive in science and engineering, leading to challenges in many important applications such as abnormal user detection in recommendation systems and novel or anomalous cell detection in bioinformatics, where the missing rates can be higher than 30% or even 80%. In this work, first, we construct and evaluate a straightforward strategy, "impute-then-detect", via combining state-of-the-art imputation methods with unsupervised anomaly detection methods, where the training data are composed of normal samples only. We observe that such twostage methods frequently yield imputation bias from normal data, namely, the imputation methods are inclined to make incomplete samples "normal", where the fundamental reason is that the imputation models learned only on normal data and cannot generalize well to abnormal data in the inference stage. To address this challenge, we propose an end-to-end method that integrates data imputation with anomaly detection into a unified optimization problem. The proposed model learns to generate well-designed pseudo-abnormal samples to mitigate the imputation bias and ensure the discrimination ability of both the imputation and detection processes. Furthermore, we provide theoretical guarantees for the effectiveness of the proposed method, proving that the proposed method can correctly detect anomalies with high probability. Experimental results on datasets with manually constructed missing values and inherent missing values demonstrate that our proposed method effectively mitigates the imputation bias and surpasses the baseline methods significantly.